A Feature Induction Algorithm with Application to Named Entity Disambiguation
نویسندگان
چکیده
The performance of NLP classifiers largely depends on the quality of the features considered for prediction (feature engineering). However, as the number of features increases, the more likely overfitting becomes and performance decreases. Also, due to the very large number of features, only slimple linear classifiers are considered, thus disregarding potentially predictive non-linear combinations of features. Here we propose an automated method for feature induction, which selects and includes in the model features and feature combinations which are likely to be useful for the prediction.The resulting model relies on a smaller feature set, is non-linear and is more accurate than the baseline, which is the model trained on the entire feature set. The method uses a greedy filtering approach based on various univariate measures of feature relevance and it is very fast in practice. Also, our feature induction method is independent of the classifier used: we applied it together with Naı̈ve Bayes and Perceptron models.
منابع مشابه
بهبود شناسایی موجودیتهای نامدار فارسی با استفاده از کسره اضافه
Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...
متن کاملEntity Disambiguation with Linkless Knowledge Bases
Named Entity Disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a reference knowledge base (e.g. Wikipedia). Such disambiguation can help add semantics to plain text and distinguish homonymous entities. Previous research has tackled this problem by making use of two types of context-aware features derived f...
متن کاملChinese Name Disambiguation Based on Adaptive Clustering with the Attribute Features
To aim at the evaluation task of CLP2012 named entity recognition and disambiguation in Chinese, a Chinese name disambiguation method based on adaptive clustering with the attribute features is proposed. Firstly, 12-dimensional character attribute features is defined, and tagged attribute feature corpus are used to train to obtain the recognition model of attribute features by Conditional Rando...
متن کاملWeb Person Name Disambiguation by Relevance Weighting of Extended Feature Sets
This paper describes our approach to the Person Name Disambiguation clustering task in the Third Web People Search Evaluation Campaign(WePS3). The method focuses on two aspects: the extended feature sets, and feature relevance weighting. Bag-of-words and named entities are most commonly used features in many existing web entity disambiguation algorithms and we further extend this basic feature ...
متن کاملHigh Performance Clustering for Web Person Name Disambiguation Using Topic Capturing
Searching for named entities is a common task on the web. Among different named entities, person names are among the most frequently searched terms. However, many people can share the same name and the current search engines are not designed to identify a specific entity, or a namesake. One possible solution is to identify a namesake through clustering webpages for different namesakes. In this ...
متن کامل